Search CORE

120 research outputs found

Evaluating Emotional Nuances in Dialogue Summarization

Author: Portet François
Ringeval Fabien
Zhou Yongxin
Publication venue
Publication date: 23/07/2023
Field of study

Automatic dialogue summarization is a well-established task that aims to identify the most important content from human conversations to create a short textual summary. Despite recent progress in the field, we show that most of the research has focused on summarizing the factual information, leaving aside the affective content, which can yet convey useful information to analyse, monitor, or support human interactions. In this paper, we propose and evaluate a set of measures

PEmo

, to quantify how much emotion is preserved in dialog summaries. Results show that, summarization models of the state-of-the-art do not preserve well the emotional content in the summaries. We also show that by reducing the training set to only emotional dialogues, the emotional content is better preserved in the generated summaries, while conserving the most salient factual information

arXiv.org e-Print Archive

Cross-domain Voice Activity Detection with Self-Supervised Representations

Author: Alisamir Sina
Portet Francois
Ringeval Fabien
Publication venue
Publication date: 22/09/2022
Field of study

Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal, which is a necessary first step for many today's speech based applications. Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics, such as Mel Filter Banks (MFBs). Such methods therefore require an extra normalisation step to adapt to a new domain where the acoustics is impacted, which can be simply due to a change of speaker, microphone, or environment. In addition, this normalisation step is usually a rather rudimentary method that has certain limitations, such as being highly susceptible to the amount of data available for the new domain. Here, we exploited the crowd-sourced Common Voice (CV) corpus to show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains, because they are computed with contextualised representations of speech across multiple domains. SSL representations also achieve better results than systems based on hand-crafted representations (MFBs), and off-the-shelf VADs, with significant improvement in cross-domain settings

arXiv.org e-Print Archive

Can GPT models Follow Human Summarization Guidelines? Evaluating ChatGPT and GPT-4 for Dialogue Summarization

Author: Portet François
Ringeval Fabien
Zhou Yongxin
Publication venue
Publication date: 25/10/2023
Field of study

This study explores the capabilities of prompt-driven Large Language Models (LLMs) like ChatGPT and GPT-4 in adhering to human guidelines for dialogue summarization. Experiments employed DialogSum (English social conversations) and DECODA (French call center interactions), testing various prompts: including prompts from existing literature and those from human summarization guidelines, as well as a two-step prompt approach. Our findings indicate that GPT models often produce lengthy summaries and deviate from human summarization guidelines. However, using human guidelines as an intermediate step shows promise, outperforming direct word-length constraint prompts in some cases. The results reveal that GPT models exhibit unique stylistic tendencies in their summaries. While BERTScores did not dramatically decrease for GPT outputs suggesting semantic similarity to human references and specialised pre-trained models, ROUGE scores reveal grammatical and lexical disparities between GPT-generated and human-written summaries. These findings shed light on the capabilities and limitations of GPT models in following human instructions for dialogue summarization

arXiv.org e-Print Archive

From speech to facial activity: towards cross-modal sequence-to-sequence attention networks

Author: Cummins Nicholas
Karas Vincent
Ringeval Fabien
Scherer Klaus
Schuller Bjorn
Stappen Lukas
Publication venue
Publication date: 01/01/2019
Field of study

Abstract Multimodal data sources offer the possibility to capture and model interactions between modalities, leading to an improved understanding of underlying relationships. In this regard, the work presented in this paper explores the relationship between facial muscle movements and speech signals. Specifically, we explore the efficacy of different sequence-to-sequence neural network architectures for the task of predicting Facial Action Coding System Action Units (AUs) from one of two acoustic feature representations extracted from speech signals, namely the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPs) or the Interspeech Computational Paralinguistics Challenge features set (ComParE). Furthermore, these architectures were enhanced by two different attention mechanisms (intra- and inter-attention) and various state-of-the-art network settings to improve prediction performance. Results indicate that a sequence-to-sequence model with inter-attention can achieve on average an Unweighted Average Recall (UAR) of 65.9 % for AU onset, 67.8 % for AU apex (both eGeMAPs), 79.7 % for AU offset and 65.3 % for AU occurrence (both ComParE) detection over all AUs.2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP) DOI: 10.1109/MMSP46350.2019 Funding : BMW Group Researc

OPUS Augsburg

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

ENHANCED SEMI-SUPERVISED LEARNING FOR MULTIMODAL EMOTION RECOGNITION

Author: Coutinho Eduardo
Dong Bin
IEEE
Marchi Erik
Ringeval Fabien
Schuller Bjoern
Zhang Zixing
Publication venue
Publication date: 01/01/2016
Field of study

University of Liverpool Repository

AV+ EC 2015--the first affect recognition challenge bridging across audio, video, and physiological data

Author: Cowie Roddy
Jaiswal Shashank
Lalanne Denis
Marchi Erik
Pantic Maja
Ringeval Fabien
Schuller Björn
Valstar Michel
Publication venue
Publication date: 01/01/2015
Field of study

We present the first Audio-Visual+ Emotion recognition Challenge and workshop (AV+EC 2015) aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological emotion analysis. This is the 5th event in the AVEC series, but the very first Challenge that bridges across audio, video and physiological data. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the audio, video and physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge, the dataset and the performance of the baseline system

Nottingham eTheses

AV+ EC 2015--the first affect recognition challenge bridging across audio, video, and physiological data

Author: Cowie Roddy
Jaiswal Shashank
Lalanne Denis
Marchi Erik
Pantic Maja
Ringeval Fabien
Schuller Björn
Valstar Michel
Publication venue
Publication date: 01/01/2015
Field of study

Nottingham ePrints

OPUS Augsburg

Nottingham eTheses

The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism

Author: Batliner Anton
Chetouani Mohamed
Eyben Florian
Kim Samuel
Marchi Erik
Mortillaro Marcello
Polychroniou Anna
Ringeval Fabien
Salamin Hugues
Scherer Klaus
Schuller Björn
Steidl Stefan
Valente Fabio
Vinciarelli Alessandro
Weninger Felix
Publication venue
Publication date: 01/01/2013
Field of study

The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as new tasks and picks up on autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader ranger of overall twelve emotional states. In this paper, we describe these four Sub-Challenges, Challenge conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants. \em Bj\"orn Schuller

^1

, Stefan Steidl

^2

, Anton Batliner

^1

, Alessandro Vinciarelli

^{3,4}

, Klaus Scherer

^5

}\\ {\em Fabien Ringeval

^6

, Mohamed Chetouani

^7

, Felix Weninger

^1

, Florian Eyben

^1

, Erik Marchi

^1

, }\\ {\em Hugues Salamin

^3

, Anna Polychroniou

^3

, Fabio Valente

^4

, Samuel Kim

^4

CiteSeerX

Hal - Université Grenoble Alpes

Enlighten

Hal-Diderot

Archive ouverte UNIGE

AVEC 2016 – Depression, mood, and emotion recognition workshop and challenge

Author: Cowie Roddy
Gratch Jonathan
Lalanne Denis
Pantic Maja
Ringeval Fabien
Scherer Stefan
Schuller Björn
Stratou Giota
Torres Mercedes Torres
Valstar Michel F.
Publication venue
Publication date: 01/01/2016
Field of study

The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks

Nottingham ePrints

arXiv.org e-Print Archive

OPUS Augsburg

Nottingham eTheses

Hal - Université Grenoble Alpes

Adieu Features? End-to-End Speech Emotion Recognition using a Deep Convolutional Recurrent Network

Author: Brueckner Raymond
Marchi Erik
Nicolaou Mihalis
Ringeval Fabien
Schuller Björn
Trigeorgis George
Zafeiriou Stefanos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/12/2015
Field of study

The automatic recognition of spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be robust enough to capture the emotional content for various styles of speaking, and while on the other, machine learning algorithms need to be insensitive to outliers while being able to model the context. Whereas the latter has been tackled by the use of Long Short-Term Memory (LSTM) networks, the former is still under very active investigations, even though more than a decade of research has provided a large set of acoustic descriptors. In this paper, we propose a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation. In this novel work on the so-called end-to-end speech emotion recognition, we show that the use of the proposed topology significantly outperforms the traditional approaches based on signal processing techniques for the prediction of spontaneous and natural emotions on the RECOLA database

OPUS Augsburg

Goldsmiths Research Online

Spiral - Imperial College Digital Repository